What are the most important phyciochemical attributes associated percieved quality of red wine ?
This dataset contains a total of 1599 rows and 12 columns.
Top 5 rows
for aestheticity, the data is transposed is transpose
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| fixed.acidity | 7.400 | 7.800 | 7.800 | 11.200 | 7.400 |
| volatile.acidity | 0.700 | 0.880 | 0.760 | 0.280 | 0.700 |
| citric.acid | 0.000 | 0.000 | 0.040 | 0.560 | 0.000 |
| residual.sugar | 1.900 | 2.600 | 2.300 | 1.900 | 1.900 |
| chlorides | 0.076 | 0.098 | 0.092 | 0.075 | 0.076 |
| free.sulfur.dioxide | 11.000 | 25.000 | 15.000 | 17.000 | 11.000 |
| total.sulfur.dioxide | 34.000 | 67.000 | 54.000 | 60.000 | 34.000 |
| density | 0.998 | 0.997 | 0.997 | 0.998 | 0.998 |
| pH | 3.510 | 3.200 | 3.260 | 3.160 | 3.510 |
| sulphates | 0.560 | 0.680 | 0.650 | 0.580 | 0.560 |
| alcohol | 9.400 | 9.800 | 9.800 | 9.800 | 9.400 |
| quality | 5.000 | 5.000 | 5.000 | 6.000 | 5.000 |
Each row of the dataset represents a observation of red wine. columns contains various objective phyciochemical attributes of the wines as well as average quality score.
Basic Statistics
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| fixed.acidity | 4.600 | 7.100 | 7.900 | 8.320 | 9.200 | 15.900 |
| volatile.acidity | 0.120 | 0.390 | 0.520 | 0.528 | 0.640 | 1.580 |
| citric.acid | 0.000 | 0.090 | 0.260 | 0.271 | 0.420 | 1.000 |
| residual.sugar | 0.900 | 1.900 | 2.200 | 2.539 | 2.600 | 15.500 |
| chlorides | 0.012 | 0.070 | 0.079 | 0.087 | 0.090 | 0.611 |
| free.sulfur.dioxide | 1.000 | 7.000 | 14.000 | 15.870 | 21.000 | 72.000 |
| total.sulfur.dioxide | 6.000 | 22.000 | 38.000 | 46.470 | 62.000 | 289.000 |
| density | 0.990 | 0.996 | 0.997 | 0.997 | 0.998 | 1.004 |
| pH | 2.740 | 3.210 | 3.310 | 3.311 | 3.400 | 4.010 |
| sulphates | 0.330 | 0.550 | 0.620 | 0.658 | 0.730 | 2.000 |
| alcohol | 8.400 | 9.500 | 10.200 | 10.420 | 11.100 | 14.900 |
| quality | 3.000 | 5.000 | 6.000 | 5.636 | 6.000 | 8.000 |
The table above provided some high level statistics for each variables in the dataset.
The first step towards understanding the relationship between wine quality and physicochemical attributes is to compute the correlations. However, a large correlation matrix is hard to read and decipher, so I created a visualisation for the correlation matrix.
Visualise Correlation Matrix
This correlation matrix visualisation uses, both size and color saturation to represent magnitude of the correlations. It uses colour hue to represent the direction of the correlations.
Using this correlation matrix, it is easy to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine. Also, I found it interesting and so what surprising that the residual sugar have no correlation with the quality.
I wonder how good a linear model based on this two variables will be.
| Dependent variable: | |
| quality | |
| alcohol | 0.314*** |
| (0.016) | |
| volatile.acidity | -1.384*** |
| (0.095) | |
| Constant | 3.095*** |
| (0.184) | |
| Observations | 1,599 |
| R2 | 0.317 |
| Adjusted R2 | 0.316 |
| Residual Std. Error | 0.668 (df = 1596) |
| F Statistic | 370.379*** (df = 2; 1596) |
| Note: | p<0.1; p<0.05; p<0.01 |
The R^2 is only 0.31, which is definitely note good enough.
The linear regression is only good if the relationship is actually linear. Given the poor performance of our simple linear model, I need to turn my attention to non-linear relationship.
Next, I will explore each phyciochemical attribute individually. Using visualisation, I hope I can uncover some non-linear relationships.
Histrogram of Wine Quality
Wine Quality Frequency
| Quality | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Freq | 0 | 0 | 10 | 53 | 681 | 638 | 199 | 18 |
The majority vast majority of wine has a rating of 5 or 6. 199 bottles of wine are rated at grade 7. Only 18 and 10 bottles are rated as 8 and 4 respectively.
Histogram for fix.acidity
Density for fix.acidity
quality vs fixed.acidity
The fix acidity has a distribution that is slightly skew to the right.
Ignoring those wine with the highest ratinga and judging from the quantiles, one might argue that the there is a positive relationship between the fixed acidity and quality. However, the variance of fixed acidity are high among all quality rating.
Histogram for Volatile Acidity
Histogram for Volatile Acidity
quality vs Volatile Acidity
Boxplot and jitted data point
The distribution of volitile acidity has a bimodal distribution, with the first modal around 0.4 and the second around 0.6. And from the density plot, we can see the second modal is largely contributed by wine with rating 5.
From both density plot and box plot we can see a strong negative relationship here. High volatile acidity means low quality. One can also see this relationship from the histogram, although not as obvious as the boxplot. There is also an increasing variance associated with lower the wine rating is.
The correlation bewteen volatile acidity and quanlity is -0.3905578.
Histogram for Citric Acid
Histogram for Citric Acid
Histogram for Citric Acid
There are a lot of wine that don’t have any citric acid at all. Other than the spike just under 0.5, the distribution appears to be quite uniform until 0.5, where it start to fade off.
It is quite clear that higher quality wine tends to have higher citirc acid content.
Histogram for Residual Suger
Distribution of Residual Suger By Quality Rating
Quanlity vs Redisual Sugar (limit from 1 to 4)
I made two boxplot this time. the second one is created for residual.sugar smaller than 3.5.
The majority of wine have residual sugar around 2.
As a non-wine drinker, it is somewhat surprising to me that I don’t see any relationship between residual sugar and quality here, as I would personally prefer a bit sweeter taste.
Histogram for Chlorides
Density for Chlorides
Quanlity vs Chlorides
Similar to residual sugar, there are a few outliers with large chlorides amount in our sample. I generated a second boxplot with 0.2 as the cutoff point.
There is a weak negative relationship between quality and chlorides.
Histogram for free.sulfur.dioxide
Quanlity vs free.sulfur.dioxide
The free sulfur dioxide has a long tail distrubution. There are little obvious relationship between quanlity and free sulfur dioxide.
Histogram for alcohol
Histogram for alcohol
Quanlity vs alcohol
While it is quite clear, that high quality wine (with rating of 7 or 8) tends to have higher alcohol content. It is not so clear for the mid-to-low range. On fact wine with rating 5 has the lowerest alcohol measured by quantiles.
The distribution is clearly not normal, and this could also contribute to the poor R^2 for our initial linear model.
We can clearly see a seperation of the high quality and low quality wine by looking at the color of the dots. However, there is still a lot of unexplained variance. As show in the facet grid above, group the data by additional variable does improve the seperation, however, only marginally.
Correlation Matrix
This visualisation provides a very compact visualisation for the correlation between the variables in the dataset. Both size and transparency of the circle are used to encode the magnitude of the correlation. The colour hue is used to represent the direction of the correlation. Using this visualisation, it is obvious to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine.
Boxplots of alcohol content for various quality of wine
Given the relatively strong linear correlation that was discovered, I was somewhat surprised to see this box plot. The lowest mean quality. The lowest mean quality appears at quality rating of 5. It’s almost seems the positive linear relationship only applys to mid to high quality wine.
Alcohol vs Volatile acidity by wine density and citric acid
Using the divergent color, one can see a clear clustering of lower quality and higher quanlity wine. However, there is still unexplain variation within the cluster.
I started out the data exploration by calculating the correlation matrix. In particular, I am interested to see, which variables have the strong linear relationship with the wine quality. I found, among all of the variables, alcohol content and volatile acidity, has the strongly linear relationship with wine quality.
Using this information, I did a simple linear regression on the data set and found that those two variables explain only 31% of the total variation in quality.
At this point, I suspected that there might be some non-linear relationship between quality and some variables. So I plotted one histogram, and one boxplot against quality, for every of the variable. Somewhat to my surprise, I didn’t found any obvious and strong relationship. And despite the strong correlation, I found the relationship between alcohol and quality is not that linear.
Finally, I made some scatterplots with the quality encoded using a divergent colour palette. It is quite clear to me that there is a clear separation between the high (rating above 5) and low quality (rating below 5). However, one can also see a lot of noise.
After the exploration, I think it is quite likely there simply isn’t enough relevant data for actuate prediction. The qualities are measured as the average of ratings by at least three wine experts. While one might argue, when this average is taken from a large number of experts’ ratings, it forms a somewhat objective measurement, thanks to the Central Limit Theorem. However, when there is only a small number of experts’ ratings are in deriving the wine quality, the rating became very subjective and heavily influenced by the preference of the individual judges. This is especially true in our case because the rating is not even derived from the group of experts for all the wines. An expert might systematically wine lower than the others experts or vice versa.
A much better prediction might be possible if more granular details are available in the dataset, for example, each experts rating on each wine instead of the just a simple average.